Advancing research on compound weather and climate events via large ensemble model simulations

Societally relevant weather impacts typically result from compound events, which are rare combinations of weather and climate drivers. Focussing on four event types arising from different combinations of climate variables across space and time, here we illustrate that robust analyses of compound events — such as frequency and uncertainty analysis under present-day and future conditions, event attribution to climate change, and exploration of low-probability-high-impact events — require data with very large sample size. In particular, the required sample is much larger than that needed for analyses of univariate extremes. We demonstrate that Single Model Initial-condition Large Ensemble (SMILE) simulations from multiple climate models, which provide hundreds to thousands of years of weather conditions, are crucial for advancing our assessments of compound events and constructing robust model projections. Combining SMILEs with an improved physical understanding of compound events will ultimately provide practitioners and stakeholders with the best available information on climate risks.

Most environmental impacts result from combinations of multiple weather and climate drivers that are referred to as compound events [1][2][3] . Given the complexity of the climate system, such events can occur in multiple ways, for example involving extreme conditions in different variables happening simultaneously or across space and time 4 . For instance, in 2014/2015, Rio de Janeiro (Brazil) was hit by concurrent extremely hot and dry summer conditions, which led to wide-ranging socio-economic repercussions that exceeded the impacts that would have been caused by heat and dryness in isolation 5 . The hot-dry weather favoured widespread wildfires as well as water scarcity that impacted coffee production 5 . Furthermore, excess fatalities due to a severe dengue fever outbreak were linked to the increase in water storage tanks installed by the population to mitigate the drought 5 . In a similar fashion, impacts from hurricanes typically arise from concurrent hazards via precipitation-driven flooding and winddriven hazards such as storm surges, as happened when Hurricane Ida led to destruction in Louisiana (United States) in 2021, causing about 30 fatalities 6 . Extreme impacts can also arise from sequences of weather events that increase system vulnerability or lead to disproportionate damages, as exemplified by ecosystem 7 and agricultural 8 impacts from the two European drought years in 2018 and 2019, and by the critically reduced water storages in South Africa due to a sequence of three dry winters in 2015-2017 9,10 . Furthermore, combinations of extremes occurring simultaneously (or almost simultaneously) across multiple regions can cause extreme impacts to connected global systems 11 . For example, simultaneous soil moisture droughts across multiple breadbaskets can reduce global food production and threaten food security [12][13][14][15] .
Compound events are thus characterized by complex multivariate relationships. Recognizing that a univariate perspective on hazards may severely underestimate risk 16 led to a rapidly growing body of literature on different types of compound events 3,4,17 , including literature on interconnected, interacting, and cascading risks 18 . However, the limited temporal length of observations and routinely used model simulations often limits the possibility of identifying and studying the sparse multidimensional constellations of climate variables constituting compound events, an issue related to the curse of dimensionality 1,[19][20][21] . Combining routinely used single-member simulations of multiple climate models, such as the standard multi-model ensembles from the coupled model intercomparison project (CMIP) 22 , offers larger sample sizes but leads to confounding irreducible uncertainties arising from internal climate variability (the inherent chaotic nature of climate) and from potentially reducible uncertainties arising from structural differences across models [23][24][25][26] . Robustly distinguishing between these uncertainties is important, however, as it can ultimately help reduce uncertainties from structural model differences 25,26 .
In this perspective, we argue that using output from climate Single Model Initial-condition Large Ensembles 24 (SMILEs) should become standard for studying and projecting compound events. A SMILE consists of many simulations (i.e. ensemble members) from a single climate model based on the same model physics and under the same external forcings, but each starting from slightly different initial states. Thus, each realization in the ensemble evolves differently solely due to internal climate variability. Combining the multiple members of a SMILE ensures therefore a much better sampling of the data sparse regions associated with compound events compared to traditional model simulations 27 . Using a combination of different SMILEs then allows for identifying model differences.
Here we propose six important topics in compound event research and illustrate associated challenges. Each of the following six sections provides information on the importance of one of the six topics and demonstrates, based on our analyses and the literature, both the challenge of addressing the topic and how using SMILEs can be of help in this context. Specifically, we demonstrate that SMILEs are important to (i) robustly assess compound event frequencies and associated uncertainties; (ii) quantify the extent to which anthropogenic climate change has contributed to observed compound events; (iii) provide robust quantification of historical and future changes in compound events; (iv) quantify and explore uncertainties in future projections, including distinguishing between uncertainties arising from internal climate variability and structural model differences; (v) identify most dangerous low-likelihood compound events, advancing their physical understanding, and exploring associated risks; and (vi) evaluate the skill of advanced statistical models developed to improve the assessments of compound events. Via analyses related to four different compound event types, we illustrate how these topics can be addressed using SMILEs. The considered compound event types are compound hot-dry warm-season events, concurrent precipitation and wind extremes, multi-year soil moisture droughts, and concurrent soil moisture droughts across global soybean breadbaskets. See Table 1 for a comprehensive list of the six research topics and the associated analyses related to different compound events.

Estimating compound event likelihoods and associated uncertainties
An important challenge of climate studies is assessing the likelihood of extreme events 26 . This information ultimately serves decision-makers, international development agencies, and insurances to develop preparedness for climate impacts 3 . Employing the approach of Bevacqua et al. 26 and focussing on the historical (1950-1980) frequencies of compound hot-dry warm seasons (f HD ), concurrent daily precipitation and wind extremes (f PW ), and multi-year (i.e., consecutive annual) droughts (f MYD ), we illustrate that using a large sample size is essential for robustly estimating compound event probabilities. Using 3100 years of data from 31-year periods of the 100 ensemble members of the MPI-GE climate model (40 members of the CESM1-CAM5 model for f PW ) allows for robustly identifying hotspot areas where compound event frequencies are higher than what would be expected when assuming that the compound drivers are statistically independent (see unstippled regions in Fig. 1a-c).
Importantly, if such frequencies were estimated based on much smaller sample sizes contained in typically available model simulations and observations, they would be highly uncertain and could ultimately lead to misleading risk estimates. For example, based on 31 years of data from a single ensemble member, someone interested in compound hot-dry event risk over Southern Asia could obtain regionallyaveraged f HD ranging from 2% to 6% (centred 90% range across the ensemble members), i.e. differing by a factor of 3 ( Fig. 1d, g; note that frequencies at the local scale can even range from about 0% to 10%). That means, depending on the data sample selected at random, a wide range of possible risks may be estimated. Similarly, f PW estimates could differ by a factor of 2 on average over Portugal, while f MYD by a factor of 12 over Central North America. Hence, based on a short sample of data, in principle, a low compound event risk may be estimated over regions that are, instead, at high risk. This sampling uncertainty stems from the fact that 31-year periods are, in this case, highly insufficient to sample internal climate variability, resulting in a wide range of possible frequency estimates across single ensemble members 26 . Similarly, this sampling uncertainty is also inherent in estimates derived from short observational records. SMILEs allow users to obtain a robust modelbased estimate of the compound event probability (i.e., the frequency computed from many years of data obtained by pooling multiple ensemble members) and allow for a quantification of how uncertain compound event frequency estimates are based on a given sample size. The latter can be estimated as the range of frequencies across single ensemble members. The differences in compound event frequencies across different ensemble members are likely to partially arise from fluctuations in climate modes of variability that influence regional weather over decadal scales. For example, Atlantic Multidecadal Variability modulates hot and dry conditions in Southern Asia 28 and droughts in Central North America 29 . SMILE-based analyses can help to assess the influence of the modes of variability on the compound events 12 .
It is important to note that the uncertainties in compound event estimates are substantially larger than for univariate extremes independently of the considered sample size, highlighting that employing a large sample size is much more important for compound than for univariate events. This is exemplified by the relative uncertainty in the compound frequency, which is defined here as the ratio of twice the standard deviation to the mean of the frequency across ensemble members 26 . Based on 31 years of data, this relative uncertainty is 2.4 and 1.1 on average over land for compound hot-dry events and univariate hot events, respectively (see continuous lines in Fig. 2a). To reach a relative uncertainty of 1.1 for compound events, 132 years of data would be required (see arrow in Fig. 2a). The results further highlight that the relative uncertainty in the compound frequency is higher at locations where the coupling between the compound drivers is weaker (compare dashed and dotted lines in Fig. 2a), that is, when the statistical dependencies make the compound event rarer, therefore more difficult to sample.
The larger uncertainties for compound than for univariate events holds for very different event types, including compound precipitation and wind extremes (Fig. 2b), despite the fact that the daily scale of precipitation-wind events yields a much larger sample size (compared to compound hot-dry warm-season events). In line with the curse of dimensionality, when we consider multi-year droughts, the relative uncertainties increase with the dimensionality of the considered compound event (Fig. 2c). Hence, employing large sample sizes is even more important for estimating the occurrence of higher-dimensional compound events such as multi-year floods, co-occurring fire drivers 30,31 , concurrent droughts in multiple regions 13 , and spatially extended floods 32 . While arguably our examples are somewhat extreme by focusing on a sample size as small as 31 years, the issues remain for larger sample sizes (Fig. 2). Furthermore, we note that it is not uncommon in the literature that probabilities of compound events are estimated based on such small sample sizes [33][34][35][36][37][38] , including probabilities of high-dimensional events 13,20,30,39 . Shorter periods than 31 years are often used for projections at fixed global warming levels, e.g. in reports of the Intergovernmental Panel on Climate Change 3 .

Compound event attribution
In the aftermath of a high-impact weather event, the extent to which anthropogenic climate change has contributed to the event is regularly questioned 40 . Addressing this attribution question is important to gain a better understanding of how climate change has affected and might affect extreme events and their impacts 41 . One way to approach the question is to frame it in terms of a probability ratio PR, defined as the ratio of the probability of an extreme event occurring under current conditions (factual world) to the probability that the event occurs in a world without anthropogenic climate change (counterfactual world) 42 . Such probabilities can be estimated using climate model simulations, for example by comparing runs simulating both worlds 43 , or by using approaches based on non-stationary extreme value theory, where the occurrence of extreme events is fitted against a covariate such as global mean temperature 44 . Values of PR > 1 imply that climate change contributed to a certain event occurrence; for instance, PR = 2 implies that climate change made a certain event class (typically defined as all events exceeding a critical threshold) two times more likely. So far, attribution studies have mainly treated events as univariate, howeverbecause many climate-related impacts are caused by a combination of multiple drivers-studies on compound event attribution are emerging [43][44][45][46] .
As attribution studies typically focus on very extreme events, they are often based on large ensemble climate model simulations. For example, the weather@home experiment, a large ensemble of simulations of the regional atmosphere-only climate model HadRM3P 47 has been widely used by the attribution community. Other studies have relied on the multi-model CMIP5 and CMIP6 ensembles to gather enough data to isolate any climate change forced signal from internal climate variability noise, though the ability to disentangle internal variability from model differences ultimately depends on the number where the compound event ensemble-mean frequency is smaller than expected in a reference case that assumes independence between the compound drivers (here, independence between average temperature and precipitation during the warm season). d, g f HD of the ensemble members associated with the 5th and 95th percentiles amongst all the members, respectively, of the f HD averaged over the region in the green box. b, e, h As in panels a, d, g, but for concurrent daily precipitation and wind extremes (f PW ) and based on the CESM1-CAM5 model. c, f, i As in panels a, d, g, but for the frequency of three consecutive annual soil moisture droughts (f 3YD ). Univariate extremes were defined via percentile-based thresholds, i.e. hot and dry seasons occurring every 10 years on average, precipitation and wind events occurring twice a year, and soil moisture droughts every 5 years (see Methods). of runs available for each model 24,48 . The required ensemble size to estimate internal variability, which is needed to provide robust attribution assessments, depends on several factors, such as the spatiotemporal scale of the event, the variable, and the amplitude of the forced trend 49 .
Here, we demonstrate that for compound event attribution, larger sample sizes are required than for univariate event attribution, in particular if the drivers are not strongly correlated and have trends of similar magnitude. We build on the synthetic data approach developed by Zscheischler and Lehner (2022) 43 . For the sake of simplicity, we focus on events that increase in probability with climate change (Fig. 3a). That is, we calculate the required sample size for the attribution of concurrent high extremes of climate variables that show positive trends due to anthropogenic climate change such as coastal flooding followed by a heatwave 50 . We define the required sample size as the number that guarantees correctly identifying PR > 1 in the presence of a forced trend, assuming no trend in the dependence between drivers 43 (see Methods). In general, very large sample sizes are needed to disentangle small climate change driven trends from internal climate variability (Fig. 3b). We find that when trends in the drivers are comparable, attributing compound events requires a larger sample size than that needed for attributing any of the contributing univariate events (see stippling in Fig. 3b). In particular, a substantially larger sample size for compound than for univariate events is needed for smaller forced trends (Fig. 3c). For example, the sample size needed under identical trends of 0.4 ⋅ σ i in the drivers differs by about 330 (i.e. by a factor 2) between the bivariate (under zero correlation) and the univariate cases (Fig. 3c). Furthermore, larger sample sizes are required when the dependence between the compound event drivers is small or negative (Fig. 3c), i.e.-in general-when the dependence makes the compound event rare. For example, the sample size needed under identical trends of 0.4 ⋅ σ i in the drivers differs by about 1200 (a factor 3.5) between the bivariate case under −0.5 and +0.5 correlation (Fig. 3c). When the trends in the compound variables differ strongly in magnitude, attributing the compound event requires a sample size in between what is needed for attributing the univariate events in isolation (missing stippling in Fig. 3b). That is, for example, attributing a compound hot-dry event to anthropogenic warming trends requires a larger sample size than that needed for attributing the heat event in isolation, though smaller than for attributing the drought event alone given that trends in droughts are much weaker than trends in heatwaves in most regions 3 .

Detecting changes in compound event characteristics
Internal climate variability arising from the chaotic nature of the climate system can obscure anthropogenic climate change signals within The large orange dot indicates the combination of trends displayed in a. Stippling indicates combinations of Trend Y and Trend X for which attributing the compound event requires a larger sample size than that needed for attributing any of the underlying univariate events (elsewhere compound attribution requires a sample size in between that needed for the two univariate events in isolation). c Sample size required for attributing any of the two univariate events (orange), and the compound event (green) assuming equal forced trends for both drivers and different constant correlation (ρ) between the drivers (see legend). Fig. 2 | Dependence of the uncertainty due to internal climate variability on the sample size. a Relative uncertainty due to internal climate variability in the frequency of univariate hot events f H (orange) and compound hot-dry events f HD (solid green). Dashed and dotted green lines show the relative uncertainty in f HD for grid points with the 35% strongest and weakest coupling, respectively (see Methods). Relative uncertainty is defined as the ratio of twice the standard deviation to the mean of the frequency f across ensemble members averaged over global landmasses (see Methods). b As in a, but for the frequency of wind extremes (f W , orange) and concurrent precipitation and wind extremes (f PW , green). c As in a, but for the frequency of a single-year soil moisture drought (f 1YD , orange), two (f 2YD , green) and three (f 3YD , purple) consecutive annual soil moisture droughts, respectively. The climate models and the definitions of the extremes are the same as in Fig. 1. short time periods 51 . Hence, especially for low-probability events, robust sampling of internal climate variability under different global warming states is crucial for assessing anthropogenically-forced changes robustly 51 . For compound events, identifying anthropogenically-forced changes requires, in addition to considering changes in the individual compound event drivers, to also quantify changes in their coupling, i.e. their statistical dependence 32,52-54 . However, quantifying any potential anthropogenic signals in the dependence, and therefore understanding the processes driving such signals, has proven to be particularly challenging so far due to the limited sample size of considered datasets 33,34,38,52,54 . For example, for compound coastal flooding-which is often driven by concurrent wind and precipitation extremes 54 -Wahl et al. 52 identified an increase in the dependencies between the drivers in the historical period over the United States' coast, but it was not possible to attribute this increase to anthropogenic climate change. Bevacqua et al. 54 suggested that internal climate variability may dominate such dependency changes even under high greenhouse gas emissions and that SMILEs may therefore be necessary for identifying any potential forced changes. The SMILE experimental design allows for a quantification of the forced response (including changes arising from mean and variability 55 of individual compound drivers, as well as from their dependence), by averaging changes across ensemble members, hence filtering out spurious trends due to internal climate variability. Once SMILE simulations allow us to disentangle such changes, further analysis can be carried out to identify the physical mechanisms driving these changes.
Focussing on compound hot-dry events, previous studies based on single-member simulations highlighted disagreement among model runs on the sign of the changes in the statistical dependence between temperature and precipitation 53 , ultimately causing large uncertainties in impact assessments. In particular, changes in this dependence are expected to modulate the heat sensitivity of crops as temperatures rise, thereby affecting crop production 56 . We illustrate that such changes in the dependence can be robustly estimated using the large ensemble simulations of the MPI-GE model as an example (Fig. 4a). Highly misleading results in the strength and sign of the changes may be obtained based on short time periods (31 years) from a single model simulation (note that this sample size is similar to that used in previous studies [33][34][35]38 ). This is due to the large uncertainty due to internal climate variability in the dependence change (Fig. 4b, compare with Fig. 4a). Thus, using model output from only one model simulation does not even guarantee to identify the correct sign in the projected change. For example over Central Europe, one may identify either a positive or negative change (Fig. 4c, d) despite the clear underlying negative forced change (Fig. 4a).
Similar considerations apply to the overall changes of compound event occurrence (not just changes in statistical dependence of drivers), as exemplified for trends in spatially compounding events. A previous study, by comparing two recent adjacent 20-year observational periods, indicated an increase in the probability of multiple breadbasket failures for wheat, maize, and soybean, and a decrease for rice 13 . However, model simulations (MPI-GE model) indicate a large variability in the frequency of a similarly defined 5-dimensional compound event, i.e. concurrent annual soil moisture droughts across the five main soybean breadbaskets worldwide (Fig. 5a), when estimated based on just 31 years of data (derived from a single ensemble member) (see blue bars in Fig. 5b). As a result, even two consecutive 31-year periods under the same historical conditions could exhibit a large increase or decrease in this frequency solely due to internal variability (grey bars in Fig. 5c). This indicates the challenge of attributing detected changes in spatially compounding events to anthropogenic climate change without employing large sample sizes. Notably, the potential for detecting both positive and negative changes persists even when considering projections in a world 2°C warmer than pre-industrial conditions (brown bars in Fig. 5c, and Fig. 5b), in which case forced changes are expected to be much larger than in the historical period. Here, large ensemble simulations allow for robustly identifying a future forced response. For instance, by merging one-hundred ensemble members of 31 years, the change in the probability of N concurrent regions under droughts (lines overlaid to brown bars in Fig. 5c), is statistically significant (95% confidence level) up to N = 4. Uncertainty sources in projections and plausible climate storylines for impact modelling Quantifying and understanding uncertainties in climate projections and identifying the potentially wide range of plausible climate futures is highly relevant for policymakers 25,57,58 . It is important to distinguish between uncertainty arising from internal climate variability, which is irreducible because of the inherently chaotic nature of the climate system, and uncertainty from structural differences amongst climate models, which could be reduced through, for instance, model improvements or emergent-constraints 59 . Uncertainty partitioning based on multi-model single-ensemble simulations 23 is expected to be biased, especially at regional scales and for variables with large variability 60 , which includes compound event frequencies. Here, we illustrate how multiple SMILEs can be used to partition uncertainties in compound event projections following recently proposed approaches 25,26 (see Methods).
For compound hot-dry events, we demonstrate that contributions from internal climate variability and structural model differences to uncertainties in compound event projections depend on the warming level, with the two sources broadly contributing comparably in a 2°C warmer world (Fig. 6a) and model differences dominating in a 3°C warmer world (see purple dominating in Fig. 6b). Uncertainty in climate projections (i.e. in the climate over a period of, for instance, 30 years) can be communicated via climate storylines, which are distinct plausible future climates 61 that can be derived from large ensemble climate model simulations 26,62,63 . For example, one storyline could be derived from a model simulation that exhibits an increase in precipitation over a region of interest, while another from a simulation that exhibits a precipitation decrease. Note that climate storylines may also be used as input to impact models to allow for an efficient exploration of the full range of plausible future climates and associated impacts (see final discussion).
Distinct climate storylines derived from different ensemble members of a single SMILE 26,62 allow for communicating "certain" irreducible uncertainty stemming from internal climate variability. In fact, the robust sampling of internal variability offered by the large number of ensemble members of a SMILE allows for identifying very extreme, albeit plausible future scenarios that may lead to the largest impacts 26,61,62 . An illustration of climate storylines with different compound hot-dry event frequencies in a 2°C warmer world over Southern Africa is provided in Fig. 6d, g. When uncertainties from model differences dominate, as it happens in a 3°C warmer world in this example, climate storylines derived from different models can be used to communicate such uncertainties transparently (Fig. 6e, h). In principle, such uncertainty from model differences could be reduced by improving the representation of the processes driving compound event changes. For example, in Southern Africa, improving the representation of processes dominating precipitation uncertainties, in particular related to shifts in the convective region and the structure of sea surface temperature in the surrounding oceans 64 , is essential to reduce uncertainties in future compound risk 26 .

Event-based storylines
The largest socioeconomic impacts are often caused by individual very rare events 65,66 . Estimating the occurrence probabilities of such rare events is very challenging, but a better understanding of the events' dynamics is critical for risk analyses. This can be done using an eventbased storyline 67 , which represents a physically self-consistent unfolding of a plausible individual event. The concept of event-based storyline differs from that of climate storyline introduced above, as the former refers to a plausible individual event (e.g., a very intense storm or crop failure and its consequences), and the latter to a plausible future climate over a long period (e.g. 30 years). The concept of eventbased storylines has been employed to study several types of rare highimpact events over recent years [68][69][70][71][72] . Event-based storylines can be used to improve our understanding of the physical drivers of very extreme events and, through collaborations between climate scientists and impact modellers, the event's socio-economic consequences 66 . However, identifying rare high-impact events that typically consist of combinations of multiple climatic drivers from short samples of data is difficult 73,74 . For example, based on datasets with limited sample size, identifying and studying the potential for combinations of physical drivers that lead to record-shattering heat events can be difficult 73 , and therefore the threats of such rare but plausible extreme events could be neglected.
In this context, the large sample of atmospheric conditions obtained by pooling multiple ensemble members of a SMILE can be used to identify very rare and potentially high-impact events that are plausible but have not yet occurred in the observational record (socalled unseen events 73,75,76 ), which is particularly useful for compound events as we illustrate with two analyses. To quantify rarity in a multidimensional space we rely on a copula-based approach 77,78 by selecting events that maximise the copula describing the distribution of the compound event drivers (see Methods).
The first analysis concerns compound precipitation and wind extremes over Portugal. Using all available simulations from multiple ensembles of the model CESM1-CAM5, we find that much more extreme events are possible compared to the most extreme event identified in a single simulation of that model (Fig. 7a). Specifically, based on the first ensemble member we identify as the most extreme event an extratropical cyclone with a central pressure below 990 hPa, which causes wind extremes along the Portuguese coast, but virtually no heavy precipitation (Fig. 7b). Based on all simulations the most extreme event is a deep cyclone with a core pressure below 970 hPa, . c Projected change in probabilities of simultaneous droughts in a 2°C warmer world relative to preindustrial conditions (brown) and variations between two historical periods due to internal climate variability (grey; 95% range of changes obtained from randomly paired historical ensemble members).
which causes extreme wind speeds and daily precipitation above 100mm in Portugal (Fig. 7c). Such an event may cause a combination of inland flooding and landslides 2 , coastal storm surges 54 , uprooted trees, and downed power lines 79 . In a second analysis, we search for concurrent droughts in five key soybean production regions using data from the model MPI-GE. The most extreme event based on all pooled ensemble members is associated with droughts that are characterized by (sometimes substantially) lower soil moisture over all regions than based on the first ensemble member alone (Fig. 7d). Here, in line with a recent SMILE-based analysis 12 that revealed the contribution of oceanic modes of variability on concurrent droughts across multiple breadbasket regions, further analyses could help identify plausible oceanic precursors of worst-case spatially compounding droughts. In general, the simulated worst-case events should be considered plausible only if the driving processes are in line with basic physical principles 80 .

Developing and evaluating statistical tools for compound events analysis
Properties of compound events are often studied via statistical modelling 2,16,20,81,82 . For example, statistical weather generators, which are used to simulate realistic weather conditions, can-in principleallow sampling unobserved extreme events and estimating the likelihood of very rare events, reducing issues related to the limited sample size of observations 77 . To meet the challenge of studying different aspects of compound events, new statistical methodologies of different levels of complexity are being developed 2,20,83-86 . However, novel statistical tools need to be tested and evaluated, and the limited sample size of observational records is a constraining factor in this regard.
Building on a perfect-model approach (introduced below), SMILEs offer a powerful testbed for new and existing methodologies 24,75 . This can be particularly useful for assessing the skills of novel methods in compound event research, which can be very complex given the need for modelling multiple inter-variable relationships and multivariate extremes 87,88 . In a perfect-model approach, the climate represented by the large ensemble simulations of one climate model is used as a testbed. The statistical model is calibrated to the data of a single ensemble member, representing the pseudo-observations that are limited in sample size. Then, the skill of the statistical model in representing the statistics of interest in the underlying climate can be For example, one could be interested in evaluating a multivariate statistical model that is fitted to 31 years of soil moisture observations across the five soybean breadbasket regions with the goal to estimate the probability of simultaneous droughts in the five regions. We illustrate using the perfect-model approach (Methods) that a statistical model based on pair-copula-constructions, a statistical tool that is widely used for high-dimensional compound events 13,20,89 , underestimates the probability of two regions under drought in favour of a possible overestimation of four regions under droughts (Fig. 8; the correct probability shown by the black line in the middle of the blue bar is outside of the sampling uncertainty in grey of the statistical model). The statistical model does also not represent the statistics of the data used for calibration (the red lines do not always fall inside the grey range), which in principle may be due to structural biases in the chosen statistical model, indicating the challenge of developing multivariate statistical models for compound events based on limited sample sizes and the advantage offered by the large sample size of SMILEs. While it seems unwise to estimate high-dimensional probabilities from limited sample sizes, this is not uncommon in the literature 13 . Finally, note that the uncertainty in the frequency of concurrent droughts arising from internal climate variability (blue bars; the range of frequencies that can be measured in a given 31-year period) is conceptually different from the sampling uncertainty from the statistical model (grey bars; an uncertainty around the frequency averaged across multiple 31-year periods), hence-based on theory-a match between these uncertainties should not be expected. Finally, we note that a user may employ a perfect model approach to test different types of statistical models and eventually choose the most appropriate one.

Discussion
While large ensemble simulations are known to be important to study very extreme univariate events 24,73,90,91 , we demonstrated that such simulations are even more relevant for compound event analysis. In principle, when studying mild extremes under stationary climate conditions, pooling data from observations or a historical single ensemble member could increase sample size and partially reduce uncertainties. However, for studying very extreme compound events and non-stationary climate conditions, very large sample sizes from multiple ensemble members are essential 14,26,32,[92][93][94][95] . Furthermore, climate models are an essential tool for investigating future changes in compound events. We recommend using SMILEs to avoid estimating potentially incorrect compound event frequencies and associated changes. SMILEs are especially important when the correlations between the compound drivers are weak and for very high dimensional compound events, that is, when identifying compound event characteristics becomes particularly difficult based on short sample sizes.
As we illustrated with several analyses, the large sample sizes of SMILEs allow for tackling the challenge of robustly estimating statistical features of compound events. However, we note that-as for virtually all of the research related to climate and weather-research gaps do not arise merely from statistical challenges. It is important to advance our understanding of the physical processes behind compound events and their changes via in-depth analyses that combine models and long observational datasets. The large sample size of SMILEs can help advance the physical understanding of compound events. For example, a SMILE-based analysis revealed that a relatively small percentage increase in precipitation intensities caused by a combination of dynamic and thermodynamic atmospheric changes will drive a disproportionately larger increase in the spatial extent of precipitation extremes 32 . Similarly, SMILEs helped reveal that the concurrence of El Niño with the cold phase of the Tropical North Atlantic can substantially enhance the probability of concurrent droughts over global breadbasket regions 12 . Such climate-model-based analyses should be carried out side by side with a climate model evaluation of the primary processes responsible for compound events (see discussion below), e.g. the climate modes of variability.
SMILEs can also help advance our understanding of very extreme events. For example, climate model simulations helped uncover the physical drivers of record-shattering hot events 73,96 . Given that climate models can have biases, very extreme simulated compound events should be considered plausible only if the identified driving processes are consistent with basic physical processes 80 . Models that are unable to capture events as extreme as those observed even with the improved sampling provided by SMILEs are likely affected by biases in their representation of real-world processes.
SMILEs further allow for robustly quantifying forced changes in compound event features, including the dependencies between compound drivers. In fact, large changes may appear when analysing data with limited sample size, although no forced trend may exist. This effect, which also exists for univariate quantities 51 , is much enhanced for compound events. Once forced changes are identified, they can be linked to their driving physical processes. For example, SMILEs can help disentangle the physical mechanisms behind dependence changes, e.g. it could be studied whether the strengthening of landatmosphere feedbacks 97 strengthens temperature-precipitation dependencies 53 , whether projected changes in storm tracks 54,98 influence the dependence between precipitation and wind extremes, and how changes in dependence scale with global warming.
While SMILEs offer many advantages compared to single-member climate model simulations, a series of considerations are worthwhile. SMILEs have already helped substantially to advance our understanding of climate variability 24 ; however, climate models can have biases that should be considered carefully. That means climate model skills need to be evaluated 43,86,99 . Assuming that any difference between model-and observation-based estimates is only due to biases can be misleading as, especially for multivariate relationships, large differences may arise from internal climate variability 43,54 . That is, climate models with a few or even one single member may fail to capture observed behaviours not because of model biases but simply because of insufficient sampling of internal climate variability. The improved sampling of internal climate variability in SMILEs allows for better model evaluation 100 -specifically, it allows for avoiding rejecting models for the wrong reason. In general, we highlight that a purely statistical assessment of the model skills in representing the ultimate quantity of interest (e.g., frequency of extreme events) should always be complemented by a process-based approach that evaluates the model representation of the driving processes of extreme events 72,80,101 and therefore the physical realism of the simulated events. Here, we argue that the latter, process-based oriented evaluation is particularly relevant for compound events. In fact, when observations are very short in sample size, while SMILE-based model evaluation can avoid rejecting models for the wrong reason, it may be very difficult to disentangle model biases from noise arising from internal climate variability in highly uncertain statistical quantities such as the frequency of compound events.
Models that pass a process-based climate model evaluation may still have residual biases 102 . Hence, bias adjustment or downscaling methods are operationally used for impact assessments 102 . For compound events, multivariate downscaling methods 103 or bias adjustments 104-107 -which adjust both the distributions of the individual variables of interest and their statistical coupling-should be considered 104,108 . Correcting SMILEs can be challenging 80,104 ; however, there are recent developments that provide bias-adjusted large ensembles 80,109 (e.g., the CanLEADv1 110 ), as well as dynamically downscaled SMILEs 27 (i.e., SMILE from regional climate models that should better represent local processes). Building on a perfect-model approach 24,75 , SMILEs could also be used to assess the skill of multivariate bias-adjustment methods 104 .
In general, running impact models for many SMILEs can be challenging due to the computational costs of some impact models and the need of bias adjusting climate models 111 . In this context, storylines derived from SMILEs can be used to explore and communicate a wide range of plausible future climates over a period of, for instance, 30 years (via climate storylines, Fig. 6), as well as the consequences of individual worst-case events (via event-based storylines, Fig. 7). The selection of such storylines should be based on a metric that takes into account the combinations of drivers that cause impacts to a given sector of interest 72 , hence the choice of the metric will depend on the sector and should be guided by impact modellers. For example, exploring impacts associated with a year characterised by widespread droughts over breadbasket regions (event-based storyline) may allow testing different adaptation strategies related to, for instance, irrigation, therefore supporting decision-making 1,2,66,68 , whereas climate storylines derived based on frequencies of cold spells occurring together with low winds might be explored for the renewable energy sector 112 . Overall, employing a limited set of storylines as input in impact models can allow for efficient exploration of a substantial range of possible impacts, therefore supporting climate risk assessments 65,72 . Given the cost of dynamically downscaling full SMILEs, downscaling single events or ensemble members can provide an effective approach for ultimately driving impact models with high-resolution climate data 72 .
In our illustrations, we employed seven CMIP5 generation global SMILEs 24 (each providing 16-100 ensemble members, see Table 1). However, new generation CMIP6 outputs that include more SMILEs are now becoming available (see https://www.cesm.ucar.edu/projects/ community-projects/MMLEA/ for a current overview). The substantial reduction of uncertainties in compound event frequencies with the increase in the sample size (Fig. 2) indicates that the many CMIP6 models that are now available with several ensemble members (16 models with at least 10 members), will offer new opportunities for studying different types of compound events. Considering such a variety of models will enable a more comprehensive study of uncertainties from model-to-model differences in climate risk assessments. Furthermore, new SMILEs from regional climate models have also started to increase in number and can allow for comprehensive regional analyses by combining regional and global SMILEs 27 . The increase in computational resources should eventually enable the generation of high-resolution SMILEs at the global scale. We would like to note that alternatives to SMILEs exist. For example, climate emulators 113,114 simulate large sample sizes at a low-computational cost; however, despite some exceptions 88 , they are typically not yet tailored to compound events. Furthermore, combining our understanding of physical processes with intelligent use of methods can also help reduce required sample sizes and thus render single-member model simulations more useful. For instance, using approaches termed dynamical adjustments allows estimating the forced response of climate change on various quantities of interest under certain assumptions with comparably small sample sizes 115,116 .
Based on a number of illustrative analyses (Table 1), we conclude that assessing compound events and their changes based on datasets of limited sample size may be misleading. Observational data, particularly when not characterised by a very small sample size, remain paramount for analysing extreme events. Nevertheless, climate models are essential tools for studying the dynamics of the most extreme events and for estimating future changes. Hence, given the increasing availability of SMILEs, we argue they should be preferred over singleensemble model simulations to provide stakeholders with wellinformed information on future climate risks associated with compound events.

Compound events
Given a variable and a model, extremes were defined as values exceeding a percentile-based threshold of the variable's distribution obtained by pooling together data of the historical period from all available ensemble members of the model. Compound event frequencies were computed empirically via counting 26,53,94 .
Compound hot-dry events were defined as high mean temperature (above the 90th percentile) and low mean precipitation (below the 10th percentile) values over the warm season (the three consecutive months with highest mean temperature during 1950-1980) 26 . In Fig. 4, to quantify historical and future Spearman correlation between mean temperature and precipitation, for each ensemble member and grid point, we first removed linear trends in the two variables 53 (detrending different ensemble members independently).
To study multi-year droughts, we defined droughts as annual average soil moisture over the total column below the 20th percentile (model MPI-GE). To assess concurrent droughts across soybean growing regions, we used the five soybean regions from Gaupp et al. 13 ; here, a drought was defined when the regionally averaged annual soil moisture was below the 25th percentile. For droughts and compound hot-dry events, data were bilinearly interpolated to a 2.5°spatial grid before all calculations.
For concurrent wind and precipitation extremes, we considered historical daily data of accumulated precipitation and mean wind speed on the native model grid (model CESM1-CAM5). Frequencies of concurrent daily extremes (f PW ) are based on precipitation and wind extremes defined as values occurring twice a year on average (i.e., values above their 100 ⋅ (1-2/365)th percentiles).
Overall, we note that considering more extreme thresholds to define compound events (e.g., the 16th percentile to define a soil moisture drought) would result in higher relative uncertainties in the frequency of compound events than those shown in Fig. 2.

Ensemble mean and uncertainty quantification
Each of the seven SMILEs consists of several simulations starting from different initial conditions, resulting in multiple ensemble members that span a range of plausible climates and associated distributions for any quantity of interest X. Given a SMILE, we obtained robust estimates of X as the average of X from individual ensemble members (ensemble mean). Furthermore, when X is a projected change, e.g., the change in the frequency of compound events (Δf), the forced response of f in the considered SMILE is the ensemble mean of Δf.  26 , we quantified the uncertainty in the statistical quantity X in a single realisation due to internal climate variability as twice the standard deviation of X from individual ensemble members of a given SMILE model (2 ⋅ U IV ). In Fig. 6a, b, where we consider multiple SMILE models to compute such an uncertainty due to internal climate variability, we obtain an average of U IV across models as where N mod = 7 is the number of models, and σ(X s ) is the standard deviation of X from individual ensemble members of the model s. In Fig. 6a, b, we show the uncertainty in the future f HD due to model-to-model differences (U MD ) relative to the sum of U MD and U IV . To this end, we quantified U MD based on the standard deviations of the ensemble-mean of f HD in the seven SMILEs 26 .

Dependence of uncertainty on sample size
In Fig. 2, following the procedure of Bevacqua et al. 26 , we computed the ratio of twice the standard deviation to the mean of f (i.e., 2 Á U IV =f ) across N ens = 12 ensemble members (for compound precipitation-wind extremes we use N ens = 30) of various fixed sample size N years . The N ens members were obtained via sampling data from pooled concatenated historical ensemble members of the SMILE considered to analyse the compound event of interest. When assessing (1) compound hot-dry events and (2) precipitation-wind extremes, we randomly sampled calendar years 54 from the concatenated data. For assessing (3) droughts over, e.g., two consecutive years, we first modified each ensemble member simulation by creating a dataset with pairs of all two consecutive soil moisture annual values (we disregarded the first year of each ensemble member for which no previous soil moisture value exists). Then, we concatenated these modified ensemble members. Finally, we sampled the N ens ensemble members taking pairs from the pooled concatenated data. To obtain dashed and dotted lines in Fig. 2a, we first computed the Spearman correlation between temperature and precipitation (based on all pooled ensemble members) and then selected the grid points with the 35% strongest and weakest coupling (i.e., strongest and weakest negative correlation coefficient), respectively.

Required sample size for attribution
Extending the approach of Zscheischler and Lehner (2022) 43 , based on synthetic data of two variables X and Y with a given Pearson correlation cor XY , we quantified-in the presence of hypothetical combinations of anthropogenically forced positive trends Trend X and Trend Y in the two variables-the minimum sample size required for attribution of univariate high extremes and concurrent high extremes. We repeated the experiment three times, for different values of the correlations cor XY (−0.5, 0, and +0.5). Note that we assume no trends in the compound drivers' dependency 43 . Each experiment was carried out in three steps: (1) Based on a bivariate standard Gaussian distribution with a given correlation cor XY , we simulated N pairs (X, Y) both for a reference period representing hypothetical preindustrial conditions, and for a hypothetical present-day period under forced trends in the variables (via modifying the means of the distribution of the two variables by Trend X and Trend Y ). To identify the sample size N required for attribution, such simulations were conducted for a set of N ranging from 5 to 3000 and repeated N bootstrap = 5000 times each. (2) For each of the N bootstrap simulations associated with each sample size N, extreme events of individual variables were defined as values above the 90th percentile of the reference period. The probability ratio PR = p 1 /p 0 was computed, where p 0 is the empirically estimated probability of an extreme event occurring in the reference period simulations and p 1 is that probability for the present-day period. PRs were computed for the two univariate extreme events and concurrent extremes. (3) We identified the minimum sample size N for which the attribution is robust, defined as a sample size for which the 95% confidence interval of PR is greater than one in 90% of the N bootstrap simulations, where the confidence interval was estimated following Zscheischler and Lehner (2022) 42,43 .

Event-based storylines
Compound events are found in a multidimensional space, for which, in contrast to a univariate space, there is no unique natural ordering, therefore identifying the most extreme compound events requires employing a metric whose extreme values are associated with meaningful events from an impact perspective. Here, we use a copula-based approach to select the most extreme events 77,78 . To identify extreme compound wintertime (December-February) precipitation and wind events, we first defined, for all pooled members of CESM1-CAM5, the variables X 1 and X 2 as the spatially-weighted average of daily precipitation and wind, respectively, over Portugal (Fig. 7). Secondly, through the R-package copula 124 , we computed the bivariate empirical copula C of these (X 1 , X 2 ), that is the empirical joint cumulative distribution function 125 of (U 1 = F 1 (X 1 ), U 2 = F 2 (X 2 )), where F i is the empirical cumulative distribution function of X i . The pairs (X 1 , X 2 ) associated with the highest copula values (and therefore the highest probability Pr(X 1 ≤ x 1 , X 2 ≤ x 2 )) are then selected as the most extreme events. To identify extreme compound droughts across soybean breadbasket regions, we employed the same conceptual approach, but extended it to five dimensions and used data from the model MPI-GE. That is, we considered five variables (X 1 , X 2 , . . . , X 5 ) that were defined as the negative values (given the interest in their extremely low values) of the breadbasket-regionally averaged annual soil moisture and computed the associated five-dimensional empirical copula C. Finally, the combinations (X 1 , …, X 5 ) associated with the highest copula values were selected as the most extreme events.
Perfect-model approach for evaluating a statistical model of spatially compounding droughts For Fig. 8, we used the breadbasket regionally averaged annual soil moisture data for the historical period. We tested how a multivariate statistical model fitted to pseudo-observations with a limited sample size (here, 31 years from the first member) represents the compound event statistics of the underlying climate defined by pooling all available ensemble members (31 ⋅ 100 years). Following the modelling of Bevacqua et al. 20 , the multivariate statistical model consists of a five-dimensional probability density function (pdf) of the variables (X 1 , X 2 , . . . , X 5 ) (the soil moisture in the five breadbasket regions). The pdf is decomposed into the five marginal distributions of X i and the copula density c, where c accounts for the dependence amongst the variables X i regardless of their marginal distributions and was modelled via pair copula constructions (PCCs) 20 . Firstly, marginal distributions were fitted through a kernel density estimate 89 (via the stats R-package 126 , function density). Then, we fitted pair-copula families and selected the best PCC structure via the function RVineStructureSelect (based on AIC) of the VineCopula R-package 127 . Copulas were fitted to the variables U i = F i (X i ), where F i is the empirical cumulative distribution function of X i . The resulting statistical model, i.e. the fitted 5-dimensional pdf, is associated with a unique probability of simultaneous droughts across N-regions, which is, however, affected by sampling uncertainty because the pdf is fitted to a sample of finite size. We estimated the sampling uncertainty via standard parametric bootstrap similar to previous studies 20,21,39 , in four steps. (1) We simulated N bootstrap = 900 samples of (X 1 , X 2 , . . . , X 5 ) with the same length as the original data (i.e., 31 years). This was achieved by simulating the variables (u 1 , u 2 , . . . , u 5 ) and transforming them to (x 1 , x 2 , . . . , x 5 ) via the inverse of the marginal kernel densities. (2) We fitted a pdf model to each of the N bootstrap samples via the same procedure outlined above.
(3) From each of these N bootstrap models, we simulated a sample of (X 1 , X 2 , . . . , X 5 ) with a length of 1000 times the original, such as to robustly obtain N bootstrap probabilities of simultaneous droughts across N-regions associated with each pdf. (4) Finally, we computed the centred 95% confidence interval of such N bootstrap probabilities, i.e. the final sampling uncertainty in the probability of simultaneous droughts across N-regions (shown via grey boxplots in Fig. 8 along with an overlaid black line indicating the mean of the N bootstrap probabilities).
Finally, we note that a user may be interested in carrying out the analyses considering different ensemble members of a given SMILE as pseudo-observations (or different SMILEs to carry the overall analysis).

Code availability
All custom codes are direct implementations of standard methods and techniques, described in detail in Methods. The code used to prepare the figures in this paper is available from the corresponding author upon reasonable request.